Global Biodiversity
Information Facility
Erik Kusch –Machine Readable Nature (MANA)
Natural History Museum, University of Oslo
Living Norway Colloquium | 24th May 2023
Illustration: GBIF data portal
What is GBIF?
Trondheim | 24/05/2023
Intergovernmental network
and research infrastructure
Provides anyone, anywhere,
free and open access to
data about all types of life on
Earth
Voluntary collaboration
through Memorandum of
Understanding (MoU)
Participant nodes,
Secretariat in Copenhagen,
Denmark
WHAT IS GBIF?
https://www.gbif.org
BY THE NUMBERS | 3RD MAY 2023
64
Country
Participants 43
Organization
al
Participants 8 703
Peer-review papers
using data
2 309 951 275
Species occurrence records 85 215
Datasets
2 479
Publishers
>119 billion
Records downloaded per month in 2022
BY THE NUMBERS | 3RD MAY 2023
64
Country
Participants 43
Organization
al
Participants 8 703
Peer-review papers
using data
2 309 951 275
Species occurrence records 85 215
Datasets
2 479
Publishers
>119 billion
Records downloaded per month in 2022
DATA FROM THE GBIF NETWORK 3rd May 2023
BY THE NUMBERS | 3RD MAY 2023
64
Country
Participants 43
Organization
al
Participants 8 703
Peer-review papers
using data
2 309 951 275
Species occurrence records 85 215
Datasets
2 479
Publishers
>119 billion
Records downloaded per month in 2022
DATASETS IN GBIF
BY THE NUMBERS | 3RD MAY 2023
64
Country
Participants 43
Organization
al
Participants 8 703
Peer-review papers
using data
2 309 951 275
Species occurrence records 85 215
Datasets
2 479
Publishers
>119 billion
Records downloaded per month in 2022
GBIF PARTICIPANT COUNTRIES
https://www.gbif.org/the-gbif-network
Map updated 2023-05-03
BY THE NUMBERS | 3RD MAY 2023 - NORWAY
213
Peer-review papers
using data (co-author
from Norway
51 204 452
Species occurrence records
(published from)
399
Datasets
(published from)
38
Publishers
(from Norway)
What Data is in GBIF?
Trondheim | 24/05/2023
A WINDOW ON EVIDENCE ABOUT WHERE SPECIES HAVE LIVED, AND WHEN
https://www.gbif.org/occurrence/search
Digitized
specimens
Observations
Literature
Remote-sensing Environmental
DNA
Common
standards
(DwC)
Data publishing
and indexing
Data discovery and use
DATA RICHNESS LEVELS
SUPPORTED BY GBIF
https://www.gbif.org/dataset-classes
Dataset description,
taxonomic/geographic/temporal scope
Dataset metadata
M
List of taxa
regional or thematic (e.g. invasive, medicinal)
Species checklists
C
Species occurrences and sampling events
dates, coordinates, sampling effort / protocol, abundance
Sampling-event data
SE
Species occurrences
dates, coordinates, basis of record
Occurrence-only data
O
SOURCES OF DATA IN GBIF: DIGITIZED MUSEUM COLLECTION SPECIMENS
SOURCES OF DATA IN GBIF: TAXONOMIC LITERATURE, OLD AND NEW
Data liberation
SOURCES OF DATA IN GBIF: DNA SEQUENCE-DERIVED OCCURRENCE DATA
MGnify -- https://www.gbif.org/publisher/ab733144-7043-4e88-bd4f-fca7bf858880
SOURCES OF DATA IN GBIF: PEER-REVIEWED PUBLICATIONS
MGnify -- https://www.gbif.org/dataset/b57c2b3d-4b95-425a-a324-e11e81d4caf3
SOURCES OF DATA IN GBIF: CITIZEN SCIENCE OBSERVATIONS
Using GBIF Data
Trondheim | 24/05/2023
Research
data portals
GBIF: MULTIPLE-PURPOSE DATA PUBLISHING SERVICES
portal
Bio-Collections
& ecology datasets
Environment
directive
reporting
A DATA RESOURCE TO SUPPORT RESEARCH AND SUSTAINABLE
DEVELOPMENT
Conservation
-Protected areas
-Threatened species
-Invasive species risk
Food Security
-Crop wild relatives
-In situ, ex situ
conservation of
genetic diversity
-Fisheries planning
Climate change
-Modelling impacts on
species ranges
-Adaptation strategies
-Mitigation benefits,
risks
Human health
-Disease risk based on
occurrence of vectors,
hosts, reservoirs
-Medicinal plants
-Hazards e.g. snakebite
PROVIDING BIODIVERSITY EVIDENCE FOR RESEARCH AND POLICY
GBIF Data
Considerations
Trondheim | 24/05/2023
GLOBAL BIODIVERSITY VS. DIGITALLY AVAILABLE DATA
Taxonomic bias towards birds and against insects and small organisms
Image: FL Fawcett in Wheller Ann. Entomol. Soc. Am. 1990
Troudet et al. Nature Scientific Reports 2017
1200 mill.
animals
300 m
plants
20 m
fungi
16 m
bacteria 0,04 m
virus
DATA BIASES
•taxonomic
bias: some
species being
more reported
than others
•temporal bias:
more data from
certain periods
•sampling bias:
some areas
being sampled
more than
others
(ART AND SCIENCE) OF TAXONOMY
Taxonomic homonyms: when filtering based on scientific name, check if all the results are in the
same part of taxonomic tree (at least kingdom)
Cuspidaria cuspidata (Olivi,
1792)
Cuspidaria cuspidata (M. Bieb.) Takht.
DATA ISSUES AND FLAGS
•Suite of flags for determining data
quality
•The ones to look for especially
•Taxon match fuzzy
•Taxon match higherrank
•Coordinate rounded
DATA SIZE
The size of the dataset you download can be substantial,
depending on the extent of your query. Ensure you have
the necessary storage and computational resources to
handle the data.
Consider using BigQuerry or Apache Spark for
interactions with really big data
Accessing GBIF Data
Trondheim | 24/05/2023
BY THE NUMBERS | 3RD MAY 2023
64
Country
Participants 43
Organization
al
Participants 8 703
Peer-review papers
using data
2 309 951 275
Species occurrence records 85 215
Datasets
2 479
Publishers
>119 billion
Records downloaded per month in 2022
GBIF translates traditional nomenclature into Operational
Taxonomic Units (OTUs)
HOW ARE DATA INDEXED IN GBIF?
Domain (Eukarya)
Phylum (Chordata)
Order (Primates)
Genus (Homo)
Species (Homo sapiens)
Family (Hominidae)
Class (Mammalia)
Kingdom (Animalia)
PhyloCode
OTUs AND THE GBIF BACKBONE
Species Hypothesis (SH) numbers [DOI]
BIN DEF0002SH ABC0001
GBIF
backbone
taxonomy
Barcode Identification Number (BIN)
MACHINE-READABILITY REQUIRES PERSISTENT IDENTIFIERS
The purpose of identifiers is
… to name things
… making it possible to refer to them
•To uniquely identify something it needs a persistent identifier, a PID.
•A Persistent Identifier is globally unique,persistent, and resolvable“.
•A PID is resolvable when it allows both human and machine users to access an object or its representation, and
its
Kernel
Information.
•Kernel Information is a structured record that contains information (metadata) about the referred object, such as a
pointer to the location where the data for the object can be found.
FAIR data is about machine-readable data
researchers & museums need to do more than simply post their data on the web for it to be re-usable.
GBIF &
IDENTIFIERS
https://www.gbif.org/occurrence/1095052193
Dataset: Vascular Plant
Herbarium, Oslo (O) UiO
Publisher: University of Oslo
Catalogue number: 2007334
•GBIF trawling for data via
GBIF Data Portal
•Search can be refined via
filters
DISCOVERING DATA IN GBIF –THE DATA PORTAL
Dataset
metadata
Species
checklists
Sampling-
event data
Occurrenc
e-only
data
Download through the GBIF Data Portal is a three-step process:
1. Select desired data
2. Stage download & wait for GBIF to finish processing
3. Download final product
DOWNLOADING GBIF DATA –THE DATA PORTAL
Discovery & Download takes four R function calls:
1. occ_search(…)
2. occ_download(…)
3. occ_download_get(…)
4. occ_download_import(…)
DISCOVERING & DOWNLOADING GBIF DATA –PROGRAMMATIC SOLUTIONS
We store all download files as long as
possible.
The download metadata page will always
resolve, but the file itself might be removed
in the future.
We strive to store all downloads, but
prioritize downloads that have been cited.
Accrediting GBIF Data
Trondheim | 24/05/2023
BY THE NUMBERS | 3RD MAY 2023
64
Country
Participants 43
Organization
al
Participants 8 703
Peer-review papers
using data
2 309 951 275
Species occurrence records 85 215
Datasets
2 479
Publishers
>119 billion
Records downloaded per month in 2022
HOW TO CITE DATA MEDIATED BY GBIF ATA
CITATION
1. Download data from GBIF.org
2. and receive recommended citation with a download DOI
3. Cite the DOI in published research or other work
Example: GBIF.org (9 November 2021) GBIF Occurrence Download https//doi.org/10.15468/dl.xxxxxx
https://www.gbif.org/citation-guidelines #CiteTheDOI
DOWNLOADS AND DATASETS ARE AUTOMATICALLY ASSIGNED DOIs
Citing the data download DOI will resolve to the dataset DOIs assigned for each dataset
contributing data records to the download set.
This way, all data publishers contributing data records will be accredited!
INCENTIVE FOR DATA REUSE
To incentivize the sharing
of useful data, the scientific
enterprise needs a well-
defined system that links
individuals with reuse of
data sets they generate
Pierce et al. Credit data generators for data
reuse, Nature 6 June 2019
#CiteTheDOI
GBIF started issuing DOIs on 3 February 2015
Source dataset #1
Source dataset #2
Source dataset #3
GBIF download
Publish
datasets
in GBIF
Final state of data
Dataset DOIs Download DOI Bibliographic DOI
Analyze
& publish
Process &
archive
institutionID
collectionID
Filter &
download
materialSampleID
identifiedByID
Source dataset #1
Source dataset #2
Source dataset #3
GBIF download
Publish
datasets
in GBIF
Final state of data
Dataset DOIs Download DOI Bibliographic DOI
Analyze
& publish
Process &
archive
institutionID
collectionID
Filter &
download
materialSampleID
identifiedByID
Source dataset #1
Source dataset #2
Source dataset #3
GBIF download
Publish
datasets
in GBIF
Final state of data
Dataset DOIs Download DOI Bibliographic DOI
Analyze
& publish
Process &
archive
institutionID
collectionID
Filter &
download
materialSampleID
identifiedByID
ROR for museums
ORCID for curators
DOI for datasets
(GRSciColl UUID for collections)
will enable the linking of museum
collection specimens to scientific
litterature and scientific actors
(authors, curators, etc)
Digital Object Identifier (DOI)
Open Researcher and Contributor ID (ORCID)
Research Organisation Registry (ROR)
THANK YOU
www.gbif.or
g
Erik Kusch |
erik.kusch@nhm.uio.no
Senior Engineer
Ma
chine Readable Nature Research Group - MANA
Department of Research and Collections
Natural History Museum
University of Oslo